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Abstract 

Given a discrete distribution, an interesting problem is to determine 
the minimum size of a random sample drawn from this distribution, 
in order to observe a given number of different records. This problem 
is related with many applied problems, like the Heaps' Law in linguis- 
tics and the classical Coupon-collector's problem. In this note we are 
able to compute theoretically the expected size of such a sample and 
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we provide an approximation strategy in the case of the Mandelbrot 
distribution. 

1 Introduction 

Let us consider a text written in a natural language: the Heaps' law is an 
empirical law which describes the portion of the vocabulary which is used 
in the given text. This law can be described by the following formula 

R m {n) = Kn? 

where R m (n) is the number of different words present in a text consisting 
of n words and taken from a vocabulary of m words, while K and (5 are 
free parameters determined empirically. In order to obtain a formal deriva- 
tion of this empirical law, van Leijenhorst and van der Weide in [2] have 
considered the average growth in the number of records, when elements are 
drawn randomly from some statistical distribution that can assume exactly 
m different values. The exact computation of the average number of records 
in a sample of size n, E[R m (n)], can be easily obtained using the following 
approach. Let S = {1,2,..., m} be the support of the given distribution, 
define X = m — R m (n) the number of values in S not observed and denote 
by Ai the event that the record i is not observed. It is immediate to see 
that F[Ai] = (1 - Pi) n , X = YULi lA t and therefore that 

m 

E[R m (n)} =m-E[X] =m-^(l- Pi ) n . (1) 

i=i 
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Assuming now that the elements are drawn randomly from the Mandelbrot 
distribution, van Leijenhorst and van der Weide obtain that the Heaps' law 
is asymptotically true as n and m goes to infinity and n << m 61-1 , where 
6 is one of the parameters of the Mandelbrot distribution (see [1] for the 
details) . 

A slightly different problem is as follows: assume that we are interested 
in the minimum number X m (k) of elements that we have to draw randomly 
from a given statistical distribution in order to obtain k different records. 
This is clearly strictly related to the previous problem and at first sight one 
expects that the technical difficulties would be similar. However, this is not 
the case: in this note we will prove that the computation of the expectation 
of X m {k) is more complicated and, even if related to other results in the 
Coupons collector's problem, it is to the best of our knowledge original. The 
formula that we obtain is computationally hard and we are able to perform 
the exact computation in the environment R (see [T]) just for distributions 
whit a support of small cardinality. Our plan for the future is to study 
further this problem in order to simplify our formula, at least in some case 
of interest. By now we propose an approximation procedure in the special 
case of the Mandelbrot distribution, widely used in the application, making 
use of the asymptotic results proven in [3j in order to derive the Heaps' law. 

The paper is organized as follows: in the second chapter we will derive 
the expected number of elements that we have to draw from a given statis- 
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tical distribution in order to obtain k different records and we will present 
some additional results related to this one. In the third chapter we will 
compute this value in the case of the Mandelbrot distribution. Due to the 
computational effort requires to compute this expectation, we present the 
exact value just for k < 8. After comparing our formula with the results 
obtained in [3], using the exact values when m is small, and the values ob- 
tained by simulation for greater values of m, we use their asymptotic results 
to propose an approximation to our formula. 

2 The expected value of X m (k) 

Let us denote by S = {1, ... , m} the support of a given discrete distribution, 
by p = (pi, . . . ,p m ) its discrete density and let us assume that the elements 
are drawn randomly from this distribution in sequence. The random vari- 
ables in the sample will be independent and the realization of each of these 
will be equal to k with probability p^. Since we are interested here in the 
number of drawn one needs in order to obtain k different realization of the 
given distribution, let us define the following set of random variables: Xi 
will denote the (random) number of drawn that we need in order to have 
the first record (which is trivially equal to 1), X2 will be the number of 
additional drawn that we need to obtain the second record and so on let us 
define, for every i < m, by Xi the number of drawn needed to go from the 
i — 1-th to the i-ih different record in the sample. From this description we 
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obtain that the random number X m (k) of drawn that we need to obtain k 
different records is equal to X\ + . . . + X& and that F[X m (k) < +00] = 1. 
We also define the following set of random variables: let Z\ be the type of 
the first record observed, Z2 the type of the second different record and so 
on until the type of the fc-th record observed in the sample. 

Remark 1 The problem that we have described above is very close to the 
classical Coupons collector's problem, which is usually formalized in a simi- 
lar way. In that case the random variables Xi denotes the number of coupons 
that we have to buy in order to go from the i — 1-th to the i-th different type 
of coupons in our collection and X m (m) represents the random number of 
coupons that we have to collect in order to complete the collection. The 
first results, due to De Moivre, Laplace and Euler ( see J2J/ for a comprehen- 
sive introduction on this topic), deal with the case of constant probabilities 
P k = m> w ^ e th- e fi rs t results on the unequal case have to be ascribed to 
Von Schelling (see J3J/,). 

In the case of a uniform distribution, i.e. when pt = 1/m for any k € 
{l,...,m}, it is immediate to see that the random variable Xi, for i £ 
{2, . . . ,m}, has a geometric law with parameter (m — i)/m. The expected 
number of drawn that we need in order to obtain k different records will be 
therefore 

E[X m (k)] = 1 + m/(m- 1) +m/(m- 2) + ... + m/(m- k + 1) . (2) 
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When the probabilities pt are unequal, the Coupons collector's problem 
fails to be useful. Indeed, in the literature it is consider just the problem to 
complete the collection, i.e. in our case to observe all the m records. This 
result, first proven by Von Schelling in [5], can be obtained in a simple and 
elegant way if we look at this problem from a slightly different point of view 
(see e.g. [2]). Let us define the following set of random variables: Y\ will 
denote the (random) number of items that we need to collect to obtain the 
first coupon of type 1, Y 2 the number of items that we need to collect to get 
the first coupon of type 2, and so on for the others coupons. In this setting, 
the waiting time to complete the collection is given by the random variable 
Y = max{Yi, . . . , Y m }. In order to compute its expected value, one can use 
the Maximum- Minimums identity (see [2], p. 345), obtaining 

E[Y] = £ E l Y i\ ~ E E I min (^ Y i)} + 2 Elmin^, Y J> Y k)} + ■■■ 

i i<j i<j<k 

.,, + (-l) m + 1 E[ m m(Y l ,Y 2 ,...,Y m )} . 
Since the random variables min(l^ 1 , Y^ 2 , . . . , Yi k ) have a geometric law with 

parameter Pi 1 +p% 2 + ...+ Pi k , one gets the formula 



m+1 -J- 



E M = E L E— +E — - — +•••+(-!) 

Pi ^.Pi+Pj .y-^.Pi+Pj+Pk pi + ...+p m 

(3) 

In order to compute E[X m (A;)] for any k < m, this elegant approach is 
no more useful. Therefore we have to go back to the first setting and try to 
compute directly the expected value of the random variables Xi,X 2 , ■ ■ ■ , Xf.. 
In the case of unequal probabilities, the law of the random variables X^s 
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is no more so simple and, in order to compute their expected values, we 
have first to compute their conditional expected values given the types of 
the preceding i — 1-th different records obtained. To simplify the notation, 
let us define p(i%, ik) = 1 — V%\ ~ ■ ■ ■ ~Vi k f° r k < m and different indexes 
H>^2) ■ ■ ■ ,ik- The main result of this section is the following proposition: 

Proposition 2 For any k G {2, . . . ,m}, the expected value of is equal 
to 

m 

E[X k \= £ ... r " n '"" ; - (4) 

and therefore 



rn 



E[x m (k)]= E e ™ = 1 + E^ 1 T+ E r ?f 2 

m 



(5) 



■ ^- Mh)p{h,i2) ■ ■ ■ p(h, ■ ■ ■ ,ik-i) 

Remark 3 When k = m, expression |2J] represent an alternative compu- 
tation of the expected number of coupons needed to complete a collection. 
The expressions and (Sj) are different and a direct combinatorial proof 
of their equivalence seems by no means trivial. From a computational point 
of view, the second formula is heavier with respect to the first one. In any 
case both of them are not computable for large values of k. 

Proof of Proposition^ In order to compute the expected value of the vari- 
able Xk, we shall conditioned this with respect to the variables Z±, . . . , Z^-i, 
where Z{, for % = l,...,m, denotes the type of the i-th different record 
observed. Let us start by evaluating ELY2]: we have immediately that 



X 2 \Z\ = i has a (conditioned) geometric law with parameter 1 — pi = p(i) 
and therefore EfXjj-Zi = i] = l/p(i). We immediately obtain that 

m m 

E[X 2 ] = E[E[X 2 |Zi]] = ^ E[X 2 |Zi = i]P[Zi = i] = £ . 

i=i i=i PW 

Let us now take fc G {3, . . . , m}: it is easy to see that 

m 

E[X k ]=E[E[X k \Z 1 ,Z 2 ,...,Z k _ 1 }] = E[X k \Z 1 = i 1 ,,Z 2 = i 2 ,--- 

■ • • , -Zfc-i = *fc-i] = Z k -i = ik-i] ■ 

(Note that E[Z { = Zj] = for any j.) The conditional law of X fc |Zi = 
ii, , ^2 = *2j ■ " " j ^fc-i = *fc-ii for ii ^ i 2 7^ • • • ^ ifc-i, is that of a geometric 
random variable with parameter p{i\, . . . , i k -i) and its conditional expected 
value is equal to p(h, ■ ■ ■ , By the multiplication rule, we get 

P[Zi = ii, Z fc _! = i fc _ x ] = P[Zi = ii]P[Z 2 = i 2 |Zi = ii] x ■ ■ ■ 

• • • x P[Zfc_i = i k -i\Zi = h, . . . ,Z k _ 2 = i k ^ 2 ] 

(note that, even though the random variables in the sample are independent, 
the random variables Z\ are not independent). From its definition we have 
that 

V\Z X = h] = Ph , 
while a simple computation gives for any s = 2, . . . , k — 1, that 

¥[Z S = i s \Z 1 =«i,... ,Z s -i = i s -i] = — 

1 - Pil ~ ■■■ ~ Pis-l 
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if i\ 7^ 12 7^ • • • 7^ ife-i and zero otherwise. Recalling the compact notation 
p(h, . . . ,i k ) = 1 ~Ph ~ ■ ■ ■ -Pi k , w e then get 

m 

E\Xk] = y -, s ? p^--- p ^ 

and the proof is complete. 

Remark 4 In the case of a uniform distribution, i.e. when pi = 1/m for 
any i € S, we have 

PhPi 2 ■ ■ ■ Pi k -1 1 



p(ii)p(ix,h) ■ ■ ■ P(h,i2, ik-l) {m - l)(m - 2) ■ ■ ■ (m - k + 1) 

It is therefore immediate to prove that the expression ^ coincides in this 
case with (TJjJ. 

3 Approximation of the expected value 

The exact formula we obtained in the previous section is nice, but it is 
tremendously heavy to compute as soon as the cardinality of the support 
of the distribution becomes larger then 10. The number of all possible 
ordered choices of indexes sets involved in ([5]) increases very fast with k 
leading to objects hard to handle with a personal computer. For this reason 
it would be important to be able to approximate this formula, at least in 
some case of interest, even if its complicated structure may suggest that 
it could be quite difficult in general. In this section we shall consider the 
case of the Mandelbrot distribution, which is commonly used in the Heaps' 



law and other practical problems. Applying the results proved in [1], we 
present here a possible strategy to approximate the expectation of X m (k) 
and present some numerical approximation in order to test our procedure. 
Let us consider R m {n) and X m (k): these two random variables are strictly 
related, since [R 

m{ n ) > — \X rn {Jt) < 71], for k < n < tti. However, we have 
seen that the computation of their expected values is quite different. With 
an abuse of notation, we could say that the two functions n *— > E[i? m (n)] and 
k i—T- E[X m (A;)] represent one the "inverse" of the other. In order to confirm 
this statement, let us consider the case studied in [3], i.e. let us assume to 
sample from the Mandelbrot distribution. Fixed three parameters m € N, 
6 £ [1, 2] and c > 0, we shall assume that S = {1, . . . , m} and 



We implement both the expressions ([I]) and ([5]) using the environment R (see 
PQ). We set the parameters of the Mandelbrot distribution to be c = 0.30 
and 8 = 1.75. Using ([5]), we compute the expected number E[X m (A;)] of el- 
ements we have to draw randomly from a Mandelbrot distribution in order 
to obtain k different records, for three levels of m, being m the vocabulary 
size, i.e the maximum size of different words. In brackets we show the ex- 
pected number of different words in a random selection of exactly E[X m (k)] 
elements, computed using ([Tj) . Results are collected in Tabled . We see 
that the number of different words we expect in a text size of dimension 
E[X m (k)] is close to the value of k and this supports our statement about 




e 
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the connection between K[R m (n)] and E[X m (A;)]. As underlined before, we 
can compute these expectations only for small values of k. 





Vocabulary size 




m = 5 


m = 8 


m = 10 




k = 2 


2.80 (1.97) 


2.63 (2.00) 


2.57 (2.01) 




k = 3 


6.08 (2.87) 


5.17 (2.95) 


4.93 (2.97) 




k = 4 


12.42 (3.76) 


9.01 (3.90) 


8.31 (3.92) 


number of different words 


k = 5 


28.46 (4.59) 


14.81 (4.84) 


13.04 (4.88) 




k = 6 




23.95 (5.77) 


19.68 (5.84) 




k = 7 




39.96 (6.69) 


29.21 (6.80) 




k = 8 




77.77 (7.55) 


43.66 (7.74) 



Table 1: Expected text size in order to have k different words taken from a 
vocabulary of size m 



At the same time, since E[i? m (n)] < m, it is clear that our statement 
that n i — y E[R m (n)] and k i— >• E[X m (/c)] represent one the "inverse" of the 
other could be valid just for values of k small with respect to m. This idea 
arises also from Table ([1]), but in order to confirm this we shall compare the 
two functions for larger values of m. Since our formula is not computable 
for values larger then 10, we shall perform a simulation to obtain its approx- 
imated values. In Figure ([T]) we compare the values of the two functions for 
m = 100 and for values of k ranging from 1 to m. Again, we suppose the 
elements are drawn from a Mandelbrot distribution with the same value of 
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c and 9. The two functions are close up to k = 90, while for larger values 
of k the distance between the two values increases. Thanks to these results, 
we propose the following approximation strategy: the main result proven in 
[I] is that 



when n, m — > oo with validity region n « m , where j3 = 6 and 
a = a^r(l-/3), where = lim m _> 

oo 0"m (see the expression ([6|)). Assuming 
that for values of n « mP~ l , n i— > E[i? m (ra)] and k i— > E[X m (fc)] could 
represent one the "inverse" of the other, we get 



with validity region k « t, where r is the approximated value of k for which 
E[X m (/c)] = m?^ 1 . In order to test our approximation scheme, we shall take 
the same value of the constants as before, m = 500, k = 1, . . . , 60. Figure 
d2]) shows the results: we obtain a very good correspondence between the 
simulated values and the approximation curve in the range of applicability 
k « 25. 
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Figure 1: Comparison between K[X m (k)] (filled red circles) and E[i2 TO (n)] 
(solid black circles) for m = 100 and k = 1, . . . , 100 (main figure). Zoom: 
comparison between E[X m (/c)] (solid red line) and K[R m (n)] (dashed black 
line) for k = 1, . . . , 80 (sx) and k = 81, ... , 100 (dx) 
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Comparison between E(X m (k)) and Heaps' Law 
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